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Abstract 

This paper discusses the application of LI -regularized maximum entropy modeling or SLl-Max ijO) 
to multiclass categorization problems. A new modification to the SLl-Max fast sequential learning 
algorithm is proposed to handle conditional distributions. Furthermore, unlike most previous studies, 
the present research goes beyond a single type of conditional distribution. It describes and compares a 
variety of modeling assumptions about the class distribution (independent or exclusive) and various types 
of joint or conditional distributions. It results in a new methodology for combining binary regularized 
classifiers to achieve multiclass categorization. In this context. Maximum Entropy can be considered 
as a generic and efficient regularized classification tool that matches or outperforms the state-of-the art 
represented by AdaBoost and SVMs. 



1 Introduction 

A new form of maximum entropy (maxent) with a sequential updating procedure and LI regularization 
(SLl-Max) was recently introduced O as a probability distribution estimation technique. This study adapts 
SLl-Max to classification problems. It demonstrates a regularized linear classification algorithm that bears 
striking similarities with large margin classifiers such as AdaBoost fTO'.'Tl. 

Conditional maxent models |4| (also known as conditional exponential or logistic regression models) 
were previously applied to classification problems in text classification These models were shown 1 10] 
to be a generalization of Support Vector Machines (SVMs) 1201 or a modification of AdaBoost normalized 
to form a conditional distribution (VP\. The three aforementioned references employed the L2 type of 
regularization. LI regularization was proposed for logistic regression |15|, which is a particular case of 
maximum entiopy. The application of conditional maxent to part-of-speech tagging or machine translation 
problems |17| can also be seen as a classification problem, where the number of classes is very large. 
Solutions dealing with the computational and memory usage issues arising from this large number of classes 
were proposed for translation applications. Most of these studies focus on specific applications. We could 
not find studies of maxent as a generic classification algorithm that can be applied to a wide range of 
problems. 

This situation can be contrasted to the literature on large margin classifiers, where extant studies |2| 
cover their adaptation to multiclass problems. Large margin classifiers were initially demonstrated on binary 
classification problems and extended to multiclass classification through various schemes combining these 
binary classifiers. The simplest scheme is to train binary classifiers to distinguish the examples belonging to 
one class from the examples not belonging to this class. This approach is usually referred to in the literature 
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as 1-vs-other or 1-vs-all. Many other combination schemes are possible, in particular 1-vs-l where each 
classifier is trained to separate one class from another. More general combination schemes include error 
correcting output codes (ECOC) 1 8 1 and hierarchies of classifiers. 

Our goal is to include maxent among the regularized classification algorithms one would routinely con- 
sider, and implement it in a software package that would be as easy to use as SVMs and Adaboost packages. 
The expected advantage of maxent over other classification algorithms is its flexibility, both in terms of 
choice of distribution and modeling assumptions. SLl-Max provides the ideal starting point for this work: 
this algorithm estimates maxent model parameters in a fast sequential manner, and supports an effective and 
well understood LI regularization scheme (which leads to sparser solutions than L2 regularization). The new 
contributions in this paper are the following. We adapt SLl-Max to conditional distributions, which requires 
the derivation on a new bound on the decrease in the loss. We compare the joint and the class-conditional 
distributions to the conditional distribution traditionally considered in the literature. We introduce "non- 
class" maxent models to reduce multiclass problems to a set of binary problems (to our knowledge, such 
techniques have only been used in question answering systems 1 18|). We show, through experiments, that 
maxent statistical interpretation leads to a new methodology for selecting the optimal multiclass approach 
for a given application. 

Section ^introduces the notation to handle classification problems. In Section|3l we adapt the SLl-Max 
algorithm to estimate parameters of the joint, class-conditional and conditional distributions. In Section |4j 
we generalize these techniques to the multi-label case. After a discussion about the implementation in 
Section|5l comparative experiments are provided in Section|6l 



2 Definitions and notation 

Our sample space covers input-label associative pairs (x, c) G X x {1, . . . , Our goal is to determine, 
for a given input x, the most likely class label c* which maximizes the unknown conditional distribution 
c* = argmax^p(c|x). For simplicity, this paper initially focuses on classification where each input is 
associated with a single label. From a statistical viewpoint, this means that classes are exclusive (i.e. they 
cannot occur simultaneously). Models where multiple labels are allowed will be considered later. 
The apphcation of B ayes Rules writes 

I . p(x, c) ^ p(x|c) 
p(x) p(x) 

As p(x) does not impact the choice of the class, we can choose which distribution we want to estimate: joint 
with p(x, c), conditional with p(c|x), or class-conditional with p(x|c). The rest of this section introduces 
notation to manipulate these distributions in a consistent fashion. 

In maxent, training data is used to impose constraints on the distribution. Each constraint expresses a 
characteristic of the training data that must be learned in the estimated distribution p. Typical constraints 
require features to have the same expected value as in the training data. Features are real valued functions 
of the input and of the class /(x, c). To represent the average of a feature / over a distribution p, we use the 
following notation: 

Joint: The expected value of / under p is 
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Conditional: For a given training example Xj, Pi{c) = p(c|xj) and 

Piif] = '^P{c\^i)f{^i:C) = ^Pi(c)/(xi,c). 
c c 

Class-conditional: For a given class c, Pc(x) = p(x|c) and 

X X 

The training data is a set of input-label pairs (xi, ci), . . . , (x^, Cm), and is a subset of the sample space 
{xi, . . . , yijn} X {1) • • • ) !'}■ The empirical distributions over this training set are defined as follows: 

p(x, c) = — |{1 < i < m : Xj = x and q = c}| 

m 

_ |{l<z<m:xj=x and Cj = c}| 

" |{l<i<m:Q = c}| 

p(c) = — |{1 < i < m : Cj = c}| 
m 

All maxent models are based on the computation of a Unear score over the features, represented by the 
inner product between the feature vector and the weight vector A. In the classification case, the feature vector 
f (x, c) is defined over all input-label pairs (x, c). This pair is scored with the inner product A • f (x, c) and 
compared to other pairs (x, d) with d ^ c. A subtlety that arises in the application of maxent to classification 
problems is the need to multiply each feature as many times as there are classes. Suppose the input x is a 
list of n values 'yi(x), . . . , Vn(x), the class dependent features are defined as follows: 



f. fx r)- i ^^^^^ if C = d 
JdA^^cj - I Q otherwi 



(2) 

otherwise 

With this representation,the inner product between the feature vector f (x, c) and the parameter vector A 
simplifies as 

■ f (x, c) = Y^ ^d,jfd,j{^^ c) = ^ >^c,jVj{yi) = Ac • v(x), 

d,j j 

where Ac is the subset of parameters specific to class c. 



3 Trying Different Distributions 

This section provides SLl-Max solutions for the estimation of the joint, class-conditional and conditional 
distributions. It also shows some limitations of these solutions that will be overcome in the next section. 
The first subsection focuses on joint distribution p(x, c). 

3. 1 Estimating j oint distributions 

Maximum Entropy restricts the trained model distribution p so that each feature has the same expected and 
empirical means. Our notation summarizes this constraint as p[fd,j] = p[fd,j]- In the regularized case, this 
constraint is softened to have the form — p[fd,j] I < Pdj^ where is a regularization parameter. 
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Within these constraints, we are looking for the distribution which is the closest to the uniform dis- 
tribution, by maximizing the entropy H{p) = — ^p(xj, c) lnp(xj, c). This corresponds to the convex 

i.c 

program: 

Vi : max//(p) subject to 

^p(Xi,Ci) = 1 

V(i, j : \p[fd,j\ -P[fd,j\\ < Pd,j 
The dual program maximizes the Ukelihood over the exponential distributions. 

Qi(A) : mmL|(A)with 

' lI{X) = - p^nqxl+Y,^d,j\XdJ 

Likelihood J^'^ , 

V ' 

< Regularization 
c i 

Note that for the joint distribution, the Zx normalization is performed over all classes and all training 
samples. [9| prove the convergence of a sequential-update algorithm that modifies one weight at a time. 
This coordinate-wise descent is particularly efficient when dealing with a large number of sparse features. 
A bound on the decrease in the loss is 

Lf (V) - Lf (A) < -6p[U,] + In (l + (e^ - l)qx[fd,j]) 

+ Pd,j {\^d,j + S\ - \\d,j\) , (3) 

with equality if we have binary features. The values of 5 that minimize this expression can be obtained in a 
closed form. Note that this analysis must be repeated for all features j and all classes d. 

Efficient implementations of the sequential-update algorithm require storing numerous variables and 
intermediate computations. For instance, we need to store all the qx{'^,c) and A • f(xj,Cj). The storage 
requirement in 0(m x n) can be problematic for large-scale problems. 

As a matter of fact, we found memory requirements to be the main limitation of this implementation of 
multiclass SLl-Max. Speedup techniques based on partial pricing strategies L5J have reduced the learning 
time of SLl-Max and made it manageable. 

3.2 Estimating class-conditional distributions 

The motivation for using the class-conditional distribution p(x|c) is that it allows to build one model per 
class. From Eq.©, it is easy to see that for d / c, features fdj have no impact on the class-conditional 
distribution pc we are trying to estimate. As a result, separate optimization problems can be defined for the 
I classifiers with no interaction between them. For each class, the convex problem V2 and its convex dual 
Q2(Ac) are: 

V2 '■ niaxH{pc) subject to 

Pc 

x;pc(xi) = 1 

i 

Vj : \Pc[fc,j] -Pc[fc,j]\ < f3c,j 
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Q2{K) ■ mmL^(Ac) with 

Zx{c) = ^e^-f^^--^) 

i 

For each class c, we have an independent optimization problem to solve. For each of these optimization 
problems, we have a general solution which is just SLl-Max. 

A clear advantage of the class-conditional over the joint distribution approach is that, when optimizing 
Pc, we do not have to store the variables used to optimize the class-conditional distributions of the other 
classes. This provides huge savings in memory (i.e the memory requirement is divided by the number of 
classes). 

A drawback of the class-conditional approach is that it does not minimize explicitly the classification er- 
ror rate. To obtain the recognized class, it relies on the application of the Bayes rules, and thus on the fact that 
the probability distributions have been properly estimated. Taking the logarithm of argmax^p(c)qA(x, c) , 
this class is ^ argmax^, (A • f (x, c) — In Z\{c) + Inp(c)). 



3.3 Estimating conditional distributions 



In this section, we propose a novel extension to the SLl-Max algorithm to estimate the parameters of the 
maxent model for the conditional distribution p(c|x). In the hterature fT6lfT7l . conditional maxent is typi- 
cally the only distribution considered for classification, as it is expected to be the most discriminant. How- 
ever, its optimization turns out to be more complex, so we present it last. 

In the case conditional distributions, the main challenge is that, for each training sample i, we want to 
estimate one separate distribution over the classes pi. At the same time, the constraints apply to the en- 
tire training set and tie up these distributions. If we trained each distribution separately for each sample i, 
constraints would be Vd, j : ~ PAfd.jW < l^d,i,j- This would result mm x n x I learnable param- 

eters, with obvious overfitting. On the other hand, summing these constraints over the examples produces 



- P[fd,j\ 



i 

different sequential-update algorithm. 
The two optimization problems are: 



< Pdj- This formulation was used before llT6l . but we added regularization and a 



: max V subject to 

\fi : = 1 



Vd, j : 



T^T.Pi[fd,j] -p\fdA 



<Pd. 



'Note that this reUes on a good estimation of the Zx{c) normalization factors, With a joint distribution, there is no need to 
compute the Zx normalization factor and argmax^ qx (x, c) = argmax^ A • f (x, c). 
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QsiX): mmL|(A)with 



The likelihood can be expanded as 



m 

d,j I 



The novelty in problem Q3(A) involves having one normalization constant per example. In the development 
of the likelihood, a single logarithm \nZx is replaced by the sum — ^lnZ;^(xj). To bound L?{X') — 



Lp{X), the most difficult step is to bound: 



-Yin 



^A'(xi) 



(5) 



< - V In (l + (e^ - l)gA(xi, (xi, d) 



m ^ ^A(xi) 

(6) 
(7) 

<ln[l + ie'-l)q'x[U,]) (8) 
where q'xifdj] = ^ E qx{^i,d)fd,j{^i, d). 

i 

Eq.Q uses 

for fdji'Ki, d) G [0, 1] with equality if fdj{'x.i,d) G {0, 1}. Eq.© relies on the convexity of the log function 
to apply Jensen's inequality. 

We have established here a new bound on the decrease in the loss for LI -regularized conditional maxent 
models: 



Lf (A') - Lf (A) < -SPlfdj] + In (l + (e^ - l)q'x[fdj[ 



+ Pd,j {\>^dj + S\ - \Xd,j\) . (9) 



Because of its similarity to the standard SLl-Max bound (Eq.Q), it allows a simple generalization of the 
SLl-Max algorithm to conditional distributions by replacing the qx[fd,j] with q'xifdj]- Our experiments in 
the case of binary features show that the bound given in Eq.Q is very tight. It can be used to obtain, in a 
closed form manner, a value for 6 that is close to the optimum. 
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In the case of conditional maxent, it is instructive to compare this algorithm to Improved Iterative Scal- 
ing(IIS) [4 1, which also updates the parameters to maximize a bound on the decrease in the loss. First, the 
bounds are significantly different. The SLl-Max bound is tighter because it requires only a single Xj to be 
modified at a time. Second, while both approaches support closed form solutions under specific conditions, 
these conditions are very different: the features must be binary in the case of SLl-Max, and they must add 
up to a constant value in the case of IIS. Finally, to our knowledge, there is no simple modification of IIS to 
handle Ll-regularization. 



4 Multi-label categorization 

The fundamental modeling assumption we have made so far implies that each example i only carries a 
single label Cj. However, in many classification problems, a given input can correspond to multiple labels 
(multi-label). 

For simplicity, we assume that there is no form of ranking or preference [ IJ among the labels. Our 
sample space covers input-code pairs (x, y) € X x {0, 1}^ where y is a binary output code. 

Class-conditional distributions represent the easiest way to deal with multiple labels. We only focus 
on the estimation of pc and the fact that there are multiple labels that can be ignored. However, the final 
classification decision will require a multiplication by p{c) that is not defined as a probability distribution 
because X]j;P(c) > 1. 

This section reviews two other techniques to handle multiple labels that only require minimum modifi- 
cations of the algorithms proposed so far and show their limitations. 



4.1 Duplicating training examples 

Assume the training data is a set of input-code pairs (xi, yi), . . . , (x^, ym)- Our goal is to project this 
training set in the smaller input-label sample space {xi, . . . , x^} x {!,...,/}. For each training sample 
(xj,yj) in the input-code space , we build Ki samples in the input-label space with i^i = |{1 < fc < / : 
yi[k] = 1}| The conditional probability function is: 

P(yW = i|x.) = { A 2f (.0, 

The problem with this approach is that the empirical distribution p is reweighted to favor examples with 
multiple labels withp(xj) = y/^'„ (we assume here that Vi, j : Xj / Xj). 



4.2 Using a non-class model 

With an output code that represents I binary decisions, a trivial solution is to build I binary classifiers. This 
is typically what I 1-vs-other classifiers do, and, in the case of maxent, one may think that the I class- 
conditional classifiers described in S ection IT2l represent such a solution. 

However, the statistical reality is more complex; an independence assumption between each binary 
output is necessary: 

p(y|x) =p(y\...,yV) = JJp(y^|x) (11) 

c 

Under this assumption, one can estimate each p(y^|x) independently. For each class c, we introduce the 
distributions (y) = p'^(y|xj) = p{y^ = y|xj) where y € {0, 1} is the the index for a secondary classifica- 
tion problem between examples belonging to class c and the other examples (which are said to be part of c 
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non-class). The independence assumption can be rewritten as pi{y) = YicPi^y'^) ^^^^ overall entropy 
can be decomposed into one entropy per class: 



yG{o,i}' c 



(12) 



As the entropy of each of the p'^ distribution entropies can be maximized separately, we have I binary 
maxent models to estimate. Conditional maxent has previously been applied to binary "question answering" 
problems fTSl. 

The framework defined by problems and Qs{X) to produce the set of parameters A can be applied 
here to the distribution p'^. The transformation of notation is described in the following table: 



Distribution 


P 




Problem 


Q3(A) 




Parameters 


A 


A^ 


Features 


f(x,c) 


f^(x,y) 


with 


cG {!,...,/} 


ye {0,1} 



The exponential distribution that solves the dual problem Q3(A'^) takes the form: 

1 



-2'aHx) 



,A<=-f-(x,j/) 



The simplifying assumption of Eq.© becomes here: 



/y'j(x,y) 



Vj{x) if y = y' 







otherwise 



(13) 



Thus A"-f'=(x,y) 



\y ■ v(x) and the probability of observing class c becomes 



where a{x) 



QAKx,y) =^7((A5-Ag).v(x)) (14) 
is the sigmoid function. Note that for each feature f j(x), classifier c has two parame- 



ters: J used in the positive model and Aq j used in the negative model. Given a test input x, this approach 
can be used either to produce the code vector y such that g^(x, y'^) > 0.5 or the top class argmax^ 

Eq. (fT4T l suggests that, in the binary case, conditional maxent amounts to logistic regression. The use 
of Ll-regularization in logistic regression was recently analyzed IT51l . Another technique to optimize the 
logistic loss that relies on an implicit LI regularization is AdaBoost with logistic loss Q, which also uses a 
sequential update procedure similar to SLl-Max. 

While the independence assumption is not as straightforward, we can also transpose problem Vi to the 
distributions p'^(x, y) = p{'Xi, = y) and obtain the convex dual Qi{\^). 



5 Implementation: it's all about normalization 

This section shows that from an implementation viewpoint, normalization is the main differentiator in the 
algorithms described in this paper. 

We have already noted than SLl-Max is strikingly similar to AdaBoost, especially when AdaBoost is 
described within the Bregman distance framework [7 1. As a matter of fact, unregularized conditional maxent 
was shown 113J to be equivalent to AdaBoost with the additional constraint of Y^Pi{c) = 1. 
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Assume 


Implement 


Maxent 


Optim. 


z? 


Classes 


Classifiers 


Distrib. 


problem 








Joint 


Qi(A) 


No 


Exclu- 


Tied 


Conditional 


Q3(A) 


No 


sive 




ClassCond 




Yes 






Joint 


Qi(A^) 


No 


Inde- 


Sepa- 


Conditional 


Q3(A^) 


No 


pendent 


rate 


Adaboost 


No 



Table 1 : Impact of the model on the implementation. The last column indicates that the computation of the 
normalization Z is required to perform classification. 

Our implementation of SLl-Max capitalized on an earlier implementation of AdaBoost to include a 
normalization constant. The joint model requires normalization over all the classes and examples. The 
class-conditional model requires, for a given class, normalization over all the examples. The conditional 
model requires, for a given example, normalization over all the classes. 

Our implementation of multiclass SLl-Max, which derives directly from Sections 13) and W2i can be 
interpreted as a combination of 1-vs-other classifiers. We have not explored output codes or hierarchical 
structures, though multiclass SLl-Max offers a promising framework to explore these approaches, where 
the class-independent hypotheses would be much closer to reality. The most important difference between 
the various modeling hypotheses is whether the classifiers can be trained separately, or are tied by shared 
normalization constants. Training classifiers separately can be done on parallel computers or sequentially; 
in either case, the implementation is more memory efficient in a way which is critical when the number of 
classes is very large. 

There is some merit in training the classifiers together: one can minimize a unique target function 
and monitor the training process in a variety of ways. The most common stopping criterion is when the 
classification error minimum is reached on validation data. When training classifiers separately, the absence 
of a single stopping criterion makes the process much harder to monitor. Table [2 summarizes the merit of 
each modeling assumption from an implementation viewpoint. 

An implementation of SLl-Max that is optimized using partial pricing strategies fSl is provided in the 
(blanked out) software package. When classifiers can be trained separately, an SLl-Max binary classifier is 
just another 1-vs-other classifier that can be used instead of an AdaBoost or a SVM classifier. On a given 
classification learning task, choosing between SLl-Max, AdaBoost and SVM can be done with a single 
switch, or by automatically using cross-validation data. Systematic experimental comparisons between the 
three approaches for large scale natural language understanding tasks 1 6 1 indicate that SLl-Max is the fastest 
approach on datasets larger than 100,000 examples, with state-of-the-art accuracy^. 

It would be informative to compare SLl-Max to algorithms considered as the state-of-the art for the 
estimation of parameters in conditional entropy models. They include Iterative Scaling algorithms, such as 
IIS |4| and Fast Iterative Scaling(FIS) |11|, and gradient algorithms L14J . This study, which would be of 
considerable interest, is beyond the scope of this paper. 

The two key factors that contribute to the remarkable learning speed of SLl-Max have not been, to 
our knowledge, applied to most algorithms in the iterative scaling family. First, SLl-Max is based on a 

^They are only outperformed by SVMs with polynomial kernels, which are not a computationally practical because of the large 
number of support vectors 
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Reuters 


WebKB 


Super Tags 


Multi-label? 


Yes 


No 


No 


Train size 


9603 


3150 


950028 


Test size 


3299 


4199 


46451 


Num. of labels 


90 


4 


4726 


Num. of features 


22758 


25229 


95516 


Features/sample 


126.7 


129.1 


18.8 



Table 2: Key characteristics of the three datasets used in the experiments. The last line gives the average 
number of non-zero features per training vector. 

pricing strategy: modify the single parameter which causes the greatest decrease in the objective function. 
The addition of partial pricing can make the search for the parameter considerably faster. Second, the LI 
regularization adds some slack in the constraints and makes them easier to satisfy early in the optimization 
process. 

Results on the WebKB text classification task show that SLl-Max takes less than 10 seconds to learn 
3150 examples with 25,000 features, which compares favorably to more than 100 seconds when using FIS 
or IIS on a reduced set of 300 features LI IJ . (we assume comparable Pentium CPUs with a 2GHz clock). 

6 Experiments and Discussions 

The first two datasets are small enough to allow us to run the methods Qi(A) and Q3(A), which are com- 
pared to Q2(Ac), Qi(A'^), and Q3(A'^). The Reuters-21758^ dataset contains stories collected from Reuters 
newswire in 1987. We used the ModApte split between 9603 train stories and 3299 test stories. This is a 
multi-label problem, where the number of labels per story ranges from to 15. The WebKB^ dataset con- 
tains web pages gathered from university computer science departments. We selected the same categories 
as Iil6|: student, faculty, courses and projects. The 4199 samples are split between training and testing using 
a 4-fold cross-validation. 

The third dataset, which demonstrates the scaling ability of SLl-Max, is much larger; and only the 
methods Q2(Ac), Qi(A'^), and Qz{\'^) can be applied. It consists of a set of SuperTags. SuperTags are 
extensions of part-of-speech tags that encode morpho-syntactic constraints fTl) and are derived from the 
phrase-structure annotated Penn TreeBank. The characteristics of the three datasets are summarized in 
Table El 

Table 3 compares the five different multiclass SLl-Max models considered in Sections l3l and 14.21 The 
SLl-Max regularization parameter is set to /? = 0.5. AdaBoost with logistic loss and linear SVMs are 
provided as a baseline (note that our implementation of AdaBoost can be considered as a class-independent 
model with separate classifiers). The best enw rate we obtain on WebKB (7.1%) and the best F-Measure 
we obtain on Reuters (86.7%) compare favorably to the literature llT6llT2l . The training speed of Adaboost 
and SLl-Max are very similar and can be optimized using the same techniques. They are not reported here 
as a detailed comparison is reported elsewhere ll5l . 

How good are class-conditional models? The most computationally efficient model is the class- 
conditional model Q2(Ac). Table 4, which compares the computational efficiency of the Q2(Ac), Qi(A'^), 

http: //www.daviddlewis . com/ resources /test collect ions /reuters21 57 8| 

' ^^^^^^ww^^^^^^m^^du~webkbl 
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Reuters 


WebKB 


SuperTags 


AdaBoost 


14.9/86.7 


7.10 


12.0 


Linear SVM 


15.3/86.6 


7.57 




Qi(A) 


16.9/84.0 


7.95 






17.4/83.9 


8.19 


11.2 


Q3(A) 


16.6/79.6 


7.76 




Qi(A^) 


15.0/86.5 


7.50 


11.9 


Q3(A^) 


14.1/86.4 


7.45 


11.1 



Table 3: Error rates on the 3 datasets. For Reuters, which is multi-label, the first number is the top-class 
error rate (an example is considered an error if the highest scoring class given by the classifier is not part of 
the target labels) and the second number is the micro-averaged optimal F-measure. 





# parameters 


Train time 




(thousands) 


(hours) 




0.5 


0.9 


0.5 


0.9 


Qi(A^) 


53 


35 


15.65 


16.09 


Q2(Ae) 


53 


34 


14.92 


7.58 


Q3(A^) 


41 


24 


46.65 


46.40 



Table 4: Number of non-zero model parameters and training time for the SuperTags set as a function of the 
regularizer (3 and the optimization method. On this large set, /? has little impact on accuracy, and mostly 
affects speed and sparsity. 



and Q3(A'^) models, shows that it has the smallest training time and the smallest number of parameters. 
However, its error rate on smaller dataset is higher due to the estimation of the Z normalization constant. 

Class-exclusive vs. independent assumptions: The "Reuters" column of Table 3 indicates that making 
a class-exclusive assumption when it is not justified (e.g. the Reuters data is multi-label) leads to a signif- 
icant loss in performance. By contrast, the class-independent assumption is never true, but a combination 
of binary maxent classifiers, which relies on this assumption, consistently improves performance. It also 
greatly improves training speed by allowing parallelization and a small memory footprint. A combination 
of binary classifier yields excellent classification accuracy regardless of the size of the problem. However, 
comparisons on a "Question Answering" problem (TE\ suggest that they may not perform well for rank- 
ing tasks. Future work on multi-label tasks will also assess the ranking performance with specific error 
measures 1 1 1. 

Pros and cons of conditional models: For pure classification, the fully discriminant conditional models 
(Q3(A) and Q3(A'^)) yield the best results. This may justify the exclusive use of conditional models in all 
previous studies of multiclass maxent. Table 4 shows another reason to prefer conditional models: they 
are sparser and require fewer model parameters (since they only focus on performing class discrimination). 
Conditional models have two major drawbacks. First, as their outputs are normalized separately for each 
example, they tend to be poor confidence estimators. Metrics based on the comparison of the output to a 
varying threshold tend to fare poorly. For instance, in Table 3, the conditional model Q3(A) yield the lowest 
F-measure for Reuters (79.6). Second, as shown in Table 4, they can require more training time. 
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7 Conclusions 



We have shown that a sequential maxent algorithm (SLl-Max) can be applied to many classification prob- 
lems with performances which are comparable to Adaboost and SVMs. An important (and apparently 

under-appreciated) advantage of maxent for classification problems appears to be its remarkable flexibility 
in terms of modeling assumptions. In future work, this flexibility will be used to optimize maxent for prob- 
lems where ranking or rejection performances are critical, and to which traditional classification methods 
are problematic to adapt. 
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